fuzzywuzzyWe cannot prevent all inconsistencies across human beings who enter data
Inconsistent data entry isn’t always predictable
| School Name | Dept. of Ed. ID Number | City | State |
| School Name | Dept. of Ed. ID Number | City | State |
…and lots of other information
fuzzywuzzy Python package!pip install fuzzywuzzyMy approach:
fuzzywuzzy Optionsfuzz.ratio("Humpty Dumpty sat on a wall", "Humpty Dumpty Sat on a Wall!") >>> 91fuzz.partial_ratio("Humpty Dumpty sat on a wall", "Humpty") >>> 100fuzz.token_set_ratio("Humpty Dumpty sat on a wall", "Humpty Humpty Dumpty sat on a wall") >>> 100fuzz.token_sort_ratio("Humpty Dumpty sat on a wall","Dumpty Humpty wall on sat a") >>> 100Source: Jash Data Sciences
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
# Two strings to compare
str1 = "Humpty"
str2 = "Humpty!"
# Calculate fuzz ratio
simple_ratio = fuzz.ratio(str1, str2)
print(f"The fuzz ratio is {simple_ratio}")
partial_ratio = fuzz.partial_ratio(str1, str2)
print(f"The fuzz partial ratio is {partial_ratio}")
token_set_ratio = fuzz.token_set_ratio(str1, str2)
print(f"The fuzz token set ratio is {token_set_ratio}")
token_sort_ratio = fuzz.token_sort_ratio(str1, str2)
print(f"The fuzz token sort ratio is {token_sort_ratio}")Requirement already satisfied: fuzzywuzzy in /opt/anaconda3/lib/python3.9/site-packages (0.18.0)
The fuzz ratio is 92
The fuzz partial ratio is 100
The fuzz token set ratio is 100
The fuzz token sort ratio is 100
The Levenshtein Distance represents the least number of edit operations that are necessary to modify one string to obtain another string
| String1 | String2 | Levenshtein Distance | Fuzz Ratio |
|---|---|---|---|
| CAT | PAT | 1 | 67 |
| DOG | FOG | 1 | 67 |
| APPLE | APPEAL | 2 | 77 |
| PYTHON | JAVASCRIPT | 7 | 44 |
| CAR | CARROT | 3 | 50 |
Access presentation: